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Abstract 

A jet algorithm based on the /c-means clustering procedure is proposed 
which can be used for the invariant-mass reconstruction of heavy states de¬ 
caying to hadronic jets. The proposed algorithm was tested by reconstructing 
—> ft —> 6 jets and e^e“ ^ W~^W~ —> 4 jets processes at -y/i = 500 GeV 
using a Monte Carlo simulation. It was shown that the algorithm has a recon¬ 
struction efficiency similar to traditional jet-finding algorithms, and leads to 
25% and 40% improvement of the top-quark and W mass resolution, respec¬ 
tively, compared to the kT (Durham) algorithm. In addition, it is expected 
that the peak positions measured with the new algorithm have smaller sys¬ 
tematical uncertainty. 


^Also affiliated with DESY, Notkestrasse 85, 22607, Hamburg, Germany 



1 Introduction 


Jet finding algorithms are indispensable tools for the reconstruction of heavy states 
(Z, W bosons, top quarks, Higgs bosons) decaying to hadronic jets. A number of 
jet algorithms has been proposed in the past (see recent reviews [1,2]) which can be 
used for the calculation of the invariant-mass distributions for hadronically decaying 
heavy states. 

It has already been pointed out [1] that there is no algorithm which is optimal 
for all possible jet-related studies. Usually, different jet algorithms have different 
emphasis. Some jet hnders are preferable for precise comparisons with QCD theory, 
since the jet cross sections reconstructed with such algorithms have small hxed-order 
perturbative corrections, as well as small hadronisation corrections. However, such 
jet algorithms may not be the most optimal for other tasks. 

The traditional jet hnders have one signihcant drawback: miss-assignment of 
hadrons into jets is a common problem for the reconstruction of heavy states de¬ 
caying into jets. Incorrectly assigned particles lead to a broadening of the width of 
the invariant-mass peaks, as well as to a reduction of signal-over-background ratios. 
To deal with this problem, one can impose expected kinematic criteria on the re¬ 
constructed jets. However, the construction of the traditional algorithms prevents 
to include such criteria in an efficient way: the iterative procedure which combines 
particles into jets is usually based on a single distance measure between particles. 
Therefore, it is difficult to take into account a priory known information on decay 
kinematics during the jet clustering procedure. 

To solve the miss-assignment problem, one may think about an iterative proce¬ 
dure which would keep redistributing hadrons between jets until known kinematic 
criteria are met. In this case, the main question is how the particles should be 
redistributed (particles in jets with the strongest overlaps?) and what “particle- 
redistribution algorithm” should be used for this, keeping in mind that the speed 
for such procedure should be reasonably fast. 

Below we will discuss an algorithm which attempts to solve the problem of par¬ 
ticle miss-assignments. In fact, we propose a jet clustering procedure with some 
additional elements of intelligence: it minimises not only a distance measure be¬ 
tween hadrons, but also any physics-related quantity reflecting how close the hnal 
event kinematics is from the expected one. To illustrate its properties, we will con¬ 
sider e+e“ tt ^ bbW~^W~ 6 jets and e+e“ — W~^W~ 4 jets decays at 

^/s = 500 GeV. We have chosen such processes due to their simplicity, since the 
event signatures are characterised by the production of exactly six (four) hadronic 
jets. The all-hadronic top decay is also considered to be the most promising for top 
studies at the International Linear Collider (ILC), since this channel has the largest 
branching ratio (~ 44% of all tt decays). 
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2 /c-means clustering algorithm 

We will remind that the fc-means [3] clustering is among the oldest (and simplest) un¬ 
supervised learning algorithms that solve clustering problems. It has been adapted 
to classify the data in many problem domains. Below we will remind of the fc-means 
procedure. 

Let us assume that we have N particles and we know that all these particles 
should be grouped to a hxed number Nd of clusters. The main idea is to dehne 
the locations for the initial Nd centroids, or center points, in a certain phase space. 
These centroids should be placed as much as possible far away from each other. The 
next step is to associate each point belonging to a given data set to the nearest 
centroid. In the simplest approach, one could use a minimum-distance classiher 
to assign all particles to such centroids. Once this assignment is done, then the 
positions of new centroids should be recalculated. This procedure is repeated in a 
loop. As a result of such iteration, the centroids change their location step by step 
until they do not move any more. For the hnal cluster conhguration, each data point 
will be associated to the closest centroid. 

The grouping is usually done by minimising the sum of squares of the distances 
between data points and the corresponding cluster centroid, although other choices 
are also possible. For this simplest choice of the metrics, the algorithm minimises 
the quantity: 

Nd 

s = E E I - cut (1) 

fc=l ndLf^ 

where Xn is a vector representing the data point and Ck is the geometric location 
of the cluster center in the subset Lk (i.e. the data points associated with the /cth 
cluster centroid). It can be proved that the fc-means procedure always terminates 
for this metrics. However, the fc-means algorithm does not necessarily hnd the most 
optimal conhguration, and it has a signihcant sensitivity to the initial, randomly 
selected, centroid locations. Thus the algorithm should be run multiple times to 
reduce this instability effect. 

The last feature could help to construct an “intelligent” algorithm which min¬ 
imises not only a distance measure between particles and the centroids (i.e. jet 
centers), but also any physics-related optimisation criteria. To be more specihc, let 
us consider an example which is relevant for high-energy physics: e''“e“ it ^ 
hbW^W~ 6 jets process. In accordance with the topology of such events, we 
should expect that all hadrons should be clustered into six jets. Thus, six centroids 
(i.e. jet seeds) randomly located in a phase space should be specihed for the ini¬ 
tial /c-means clustering loop. The clustering can be performed by minimizing the 
distances from the centroids to hadrons in the azimuthal angle (0) and rapidity {y) 
phase space. After the end of the initial iterative procedure, the cluster topology 
can be characterised by the sum S of the distances from the centers of the jets to 
hadrons, as given by Eq. (jl}. The procedure should be repeated K times using 
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different starting locations for the centroids. This gives K solutions with the hnal 
values of the metrics Si,... Sk- The number K should be large enough to make sure 
that there are several conhgurations with the same Si. This leads to a conhdence 
that all possible conhgurations were explored and that an absolute minimum can 
be found. If there are several hnal conhgurations with the smallest Si (which are 
exactly the same), then one could say that a hadron assignment with the strongest 
particle collimation inside jets is found. It can be characterised by S^ain- 

Note that the hnal conhguration is the most optimal from the point of view of 
closeness of hadrons to the central jet positions. Certainly, it may not be the most 
optimal from the physics point of view since some hadrons (located mostly at the 
edge of the jets) could still be assigned to wrong jets. To minimise this problem, one 
can use kinematic requirements already during the /c—means clustering iterations. 
In order to take into account known event kinematics, one could multiply Si by a 
weight factor which can rehect a likeliness of a certain cluster conhguration from the 
point of view of the expected physics output. The weight factor can be proportional 
to ~ 1 — Pi, where Pi is the probability of how close a particular cluster conhguration 
is to the expected one. For example, for the fully-hadronic tt production. Si should 
be reduced if there are at least two dijets in an event with the invariant masses close 
to the hF-boson mass. 

The traditional jet hnders only minimise a certain distance measure between 
particles. For such jet algorithms, once the particle assignment is done, the event 
could either be taken (if, for example, there are two jets with the masses close to 
the W for the all-hadronic top decays) or rejected (in the opposite case). Thus, 
the event-kinematic requirements are completely external and independent of the 
jet hnding procedure. In contrast, such requirements are an essential part of the 
proposed jet clustering. This means that the new algorithm keeps analysing the same 
event by trying different hnal conhgurations until certain kinematics conditions are 
satished. Events can only be rejected if it is not possible to hnd such an assignment 
of hadrons which meets the criterion of the closeness of hadrons to jet centers and 
at the same time satishes expected physics requirements. 

For a single event, the /c-means minimisation procedure leads to diherent loca¬ 
tions of the jet centers, as well as to diherent assignment of particles into the jets. 
Typically, the particle assignments with diherent initial seeds are not drastically 
diherent one from the other. Therefore, one could view the overall picture as a 
redistribution of hadrons (mainly located in the regions of strongest jet overlaps) 
between the jets with hxed centers for all A;-means conhgurations which diher one 
from the other by diherent initial conditions. 

If the produced jets are very well collimated, then one should expect a small 
diherence between the proposed /c-means clustering and the standard jet hnding 
algorithms: in this case all /c-means cluster conhgurations with diherent initial cen¬ 
troids should give identical results (i.e. all Si will be the same). In contrast, the 
constrained /c—means algorithm could outperform the standard algorithms for events 
with broad and overlapping jets. 
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3 Top-quark production 


3.1 Durham jet finder versus unconstrained /c-means 
clustering algorithm 

To illustrate the method outlined above, we will apply it to the all-hadronic top 
decays in e’''e“ annihilation at the centre-of-mass energy of a/s = 500 GeV. The 
PYTHIA 6.3 model [4] was used to generate one million of fully inclusive 
events, including the ti production. This sample contains 14740 events with fully- 
hadronic top decays. The default PYTHIA parameters were used for the simulation. 
The initial-state photon radiation was included. The mass and the Breit-Wigner 
width of the top quarks were set to the defaults values, 175 GeV and 1.39 GeV, 
respectively. The particles with the lifetime more than 3 cm were considered to be 
stable. Neutrinos were removed from the consideration. We require all reconstructed 
jets to have the energies above 10 GeV. In order to remove events with a large 
fraction of neutrinos, we apply the momentum and the energy imbalance cuts similar 
to those used in [5]: 

|^-1|<0.07, ^4||<0.04, ^^^<0.04, (2) 

v-s E I Pi I E I Pi I 

where E^is is the visible energy, p||j {pn) is the longitudinal (transverse) component 
of momentum of a final-state particle and the sum runs over all final-state particles. 

We do not use a detector simulation for the generated events since such study is 
outside of the scope of this paper. Here we address the issue of the reconstruction of 
the invariant masses which are smeared with respect to the true masses by the parton 
shower and hadronisation effects. Also, for simplicity, no 6—tagging requirement was 
assumed. 

First, the reconstruction was done using the traditional method: jets where 
found using the exclusive mode of the k± (Durham) algorithm [6], requiring exactly 
six jets for each event. Our choice for the Durham algorithm was motivated by 
the fact that this jet finder is one of the best algorithms for the reconstruction of 
jet invariant masses in as it was illustrated using the IF-mass reconstruction 

example [1]. We use a G-I--I- version of this jet algorithm [7]. The event is taken if 
there is at least one jet-pair with the invariant mass Mjj in the range Mw ± 10 GeV, 
where M\y is the nominal mass of the W boson. Next, the dijets which passed this 
cut were combined with the rest of the jets, and then all three-jet combinations 
were plotted. Figure ^left) shows the corresponding trijet invariant masses, Mjjj. 
The fit was performed using the Breit-Wigner function together with a second-order 
polynomial for the background description. The reconstructed Breit-Wigner width 
(~ 10 GeV) is similar to that when an alternative approach for the top reconstruction 
was used [5]. The method discussed in Ref. [5] does not use the assumption on the 
W mass. 
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Now let us consider the fc-means algorithm. As a hrst step, the hnal-state hadrons 
were pre-clustered with the Durham algorithm using = 10“^. This procedure 
reduces the number of data points by a factor 3-6. The average number of the dual 
subjets for the tt production was around 20. As it will be discussed below, this step 
was necessary to reduce the computational time. The k means algorithm was run 
on the subjets. Each e^e“ event was analysed K = 300 times, every time using 
different (random) locations for the initial centroids. This number was found to be 
sufficiently large to explore all possible jet conhgurations. 

The subjet clustering was performed in the rapidity and the azimuthal angle. 
For the fc—means clustering, it is commonly accepted to normalize each variable by 
its standard deviation. Therefore, both variables were normalized such that their 
available range was approximately between 0 and 1. Without such transformation, 
the number of the reconstructed states to be discussed below is 5 — 8% lower than 
in case when the transformation is used. 

After the fc-means clustering, each e’''e“ event is characterised by the set Si, i = 
1,... K, where Si denotes the sum of all distances from the centers of the fc-means 
jets to hadrons. Only jet conhgurations with the same smallest Si were accepted. 
Typically, there are 10-20 hnal conhgurations which are characterised by the same 
S'min- The result of the Mjjj reconstruction is shown in Fig. Upright). The Mjjj 
masses were plotted only for conhgurations characterised by the minimum S'min- It 
can be seen that the fc-means algorithm leads to a better mass resolution (width) 
than the Durham jet hnder. In addition, the reconstructed peak position is closer 
to the generated top mass (175 GeV). An obvious drawback of the standard k- 
means algorithm is a smaller reconstruction efficiency (i.e. a smaller number of the 
reconstructed events) than for the Durham algorithm, since the fc-means algorithm 
in its present form has a tendency to produce low-energy jets (< 10 GeV). Below 
we will discuss how to improve the /c-means procedure. 

3.2 Constrained /c-means algorithm 

Let us again consider the fc-means algorithm, but this time we will constrain it by 
some requirement: each Si will be multiplied by an additional weight factor. This 
factor is constructed from several contributions: 

1. The hrst factor rehects the closeness of two dijet invariant masses, and 
( 2 ) 

Mjj , to the nominal W mass, Mw'- 

W, = Wa W,, Wa =1 - M^f I Wt = \M,j - Mwl 

where Mjj = -|-Mjj^)/2 represents the average invariant mass of two di¬ 

jets. The factor Wa gets small when there are two dijets with similar invariant 
masses, while BA is reduced when the average mass of the two dijets is close 
to the nominal W mass; 
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2. If there are two dijets with the masses in the range Mw ± 10 GeV, these 
dijets have to be combined with the rest of the jets. This should lead to 
several trijets which can be characterised by the invariant masses Mjjj. For 
the top production, it is expected that there are at least two trijets with similar 
invariant masses, T/jj] and Therefore, one can introduce another factor: 

VU =1 Mf] - <] I /Mm. 

where Mjjj = (M-j] + represents the average invariant mass of two 

trijets. 

Each fc-means cluster conhguration can be characterised by the factor Di = 
Si Wi^i W 2 ,i (the new index i in Wi^i and W 2 ,i denotes a cluster configuration ob¬ 
tained using a certain initial position of the centroids). Only conhgurations with 
the smallest Di were accepted. Since the clustering procedure minimizes Di, rather 
than Si, the resulting particle assignment is the most optimal not only from the 
point of view of how well hadrons are collimated in jets, but also how well such 
cluster conhguration rehects the expected ti decay property. 

The result of the constrained fc-means algorithm is shown in Fig. |2Kleft). While 
the mass resolution and the systematic off-set of the peak position are rather simi¬ 
lar to the unconstrained version of the algorithm, the efficiency of the constrained 
algorithm is signihcantly higher. Fig. |2Kright) shows the invariant masses for the 
background events (which do not contain the top events). The latter invariant mass 
does not show any structure near 175 GeV, indicating that the algorithm does not 
produce a spurious peak near 175 GeV. 

Although we do not think that the computational speed is an important issue 
at the stage when no a detector simulation is involved, a few words about the 
performance speed of the proposed algorithm is still necessary. The (constrained) 
fc-means jet algorithm is a factor two slower than the Durham jet hnder. However, 
the fc-means algorithm requires an additional pre-clustering stage for which the 
computational speed is rather similar to that for the reconstruction of six jets by 
the Durham jet algorithm^. Thus, the fc-means procedure is roughly three times 
slower than the Durham algorithm. Without the pre-clustering stage, the fc-means 
algorithm is a factor 20-30 slower than the Durham algorithm for the reconstruction 
of six jets. 


4 W^W production 

As a second example, let us consider —>• W~^W~ 4 jets at a/s = 500 GeV. 

10k events were generated with PYTHIA using the same parameters and the selec¬ 
tion as before. The W mass was set to 80.45 GeV and its width to 2.07 GeV. We 

^All the discussed jet algorithms were implemented in C/C-l—1-. 
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reconstructed exactly four jets and then plot the invariant masses of all six jet pairs. 
The fc-means algorithm was constrained by the simple criteria: Di = SiWi^i, where 
Wi =1 — Mjf I /Mjj for each /c-means clustering. 

The results of the calculations are shown in Fig. El As before, the performance of 
the fc-means algorithm is superior over the Durham jet hnder, especially for the re¬ 
constructed width. One may note that the Breit-Wigner peak shown in Fig. Enright) 
is also narrower than that for the invariant masses reconstructed with other tradi¬ 
tional jet-finding algorithms [1]. In addition, the systematical shift of the peak 
position reconstructed with the fc-means procedure is smaller than for the Durham 
algorithm. However, the number of the reconstructed W candidates is somewhat 
smaller than for the Durham algorithm. 

5 Conclusion 

A new jet clustering algorithm for the reconstruction of the invariant masses of 
heavy states decaying to hadronic jets was proposed^. It is based on the fc-means 
clustering procedure constrained by additional kinematic requirements. 

In this paper we did not try to cover many issues related to the use of this algo¬ 
rithm. For example, we did not study the question of how to apply this algorithm 
when no hxed number of jets are expected, how to use this algorithm in theoretical 
calculations, is this algorithm reliable in treating hxed-order perturbative QCD cor¬ 
rections and non-perturbative effects and, hnally, will a realistic event reconstruction 
with all detector effects included beneht from the use of this algorithm. All such 
issues have to be addressed in future. 

Note that the constrained fc-means clustering has nothing to do with the con¬ 
strained hts used in the invariant-mass reconstruction: The constrained £t attempts 
to hnd the most optimal configuration when the error matrix on the measured quan¬ 
tities are specified. The present approach does not require such input and it does 
not address the issue of the experimental precision on the reconstructed jet energies 
and their positions. Obviously, the constrained fit could also be used to improve the 
reconstruction of heavy states from jet invariant masses. 

For the proposed jet clustering, a priory specihed physics requirements on event 
kinematics can become an essential part of the minimisation procedure. In con¬ 
trast, the standard algorithms usually minimise a single distance measure. The 
proposed algorithm has good reconstruction efficiency and leads to a signihcantly 
better resolution for the invariant-mass reconstruction than the traditional Durham 
jet hnder. It is also expected that the peak positions measured with the new algo¬ 
rithm have small systematical uncertainty. Finally, the proposed /c-means approach 
can be used without any physics constrain (which only increases the reconstructed 

^The C/C-I--I- code of the constrained fc-means algorithm is available as a module “kmean- 
sjets.rmc” of the RunMC package [8]. 
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efficiency), especially when the main issue is a good resolution on the invariant-mass 
reconstruction. 
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Figure 1: The distribution of the trijet invariant masses for the reconstruction of all- 
hadronic top decays. Fully inclusive e’''e“ events were generated with PYTHIA for 
^/s = 500 GeV. The reconstruction was done using the kx algorithm (left) and the 
/c-means algorithm (right). The £t was performed using the Breit-Wigner function 
together with a second-order polynomial to describe the background. 


10 






130 140 150 160 170 180 190 200 210 130 140 150 160 170 180 190 200 210 

M (GeV) M (GeV) 


Figure 2: The dijet invariant masses for the all-hadronic top-decay channel. Fully 
inclusive events were generated with PYTHIA for ^/s = 500 GeV. The re¬ 
construction was done using the constrained /c-means algorithm (left). The £t was 
performed using the Breit-Wigner function together with a second-order polynomial 
to describe the background. The invariant masses reconstructed with the same al¬ 
gorithm using events without tt production does not have a spurious peak near the 
nominal top mass (right plot). 
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Figure 3: The dijet reconstructed invariant masses for the all-hadronic hF-decay 
channel e+e“ —> W~^W~ 4 jets. The events containing fully hadronic W~^W~ 

decays were generated with PYTHIA for = 500 GeV. The reconstruction was 
done using the Durham algorithm (left) and the constrained /c-means algorithm 
(right). The £t was performed using the Breit-Wigner function together with a 
second-order polynomial to describe the background. 
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